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Non—parametric Inference in Astrophysics 

Larry Wasserman, Chris Miller, Bob Nichol, Chris Genovese, Woncheol 
Jang, Andy Connolly, Andrew Moore, Jeff Schneider and the PICA 

group. Q 

We discuss non-parametric density estimation and regression 
for astrophysics problems. In particular, we show how to com¬ 
pute non-parametric confidence intervals for the location and 
size of peaks of a function. We illustrate these ideas with recent 
data on the Cosmic Microwave Background. We also briefly 
discuss non-parametric Bayesian inference. 

1. Nonparametric Inference 

The explosion of data in astrophysics provides unique opportunities and 
challenges. The challenges are mainly in data storage and manipulation. 
The opportunities arise from the fact that large sample sizes make non- 
parametric statistical methods very effective. Nonparametric methods are 
statistical techniques that make as few assumptions as possible about the 
process that generated the data. Such methods are inherently more flexi¬ 
ble than more traditional parametric methods that impose rigid and often 
unrealistic assumptions. With large sample sizes, nonparametric methods 
make it possible to find subtle effects which might otherwise be obscured 
by the assumptions built into parametric methods. We begin by discussing 
two prototypical astrostatistics problems. 

Problem 1. Density Estimation. Let Ai,...,A„ denote the posi¬ 
tions of n galaxies in a galaxy survey. Let f(x)dx denote the probability 
of finding a galaxy in a small volume around x. The function / is a prob¬ 
ability density function, satisfying f{x) > 0 and f f(x)dx = 1. We regard 
Xi ,..., Xn as n random draws from /. Our goal is to estimate f{x) from 
the data {Xi ,..., A„) while making as few assumptions about / as pos¬ 
sible. Figure 1 shows redshifts from a pencil beam from the Sloan Digital 
Sky Survey. The figure shows several nonparametric density estimates that 
will be described in more detail in Section 3. The structure in the data is 
evident only if we smooth the data by just the right amount (lower left 
plot).0 

Problem 2. Regression. Figures 2 and 3 show cosmic microwave back¬ 
ground (CMB) data from BOOMERaNG (Netterfield et al. 2001), Maxima 
(Lee et al. 2001) and DASI (Halverson 2001). The data consist of n pairs 
(Xi, Yi),..., (Xyj, 1^). Here, Xi is multipole moment and Yi is the esti- 


^See www.picagroup.org for latest software, papers and memberships of the PICA 
group. 

^ The data involve selection bias since we can only observe brighter objects for larger 
redshifts. However, the sampling is fairly complete out to about z = 0.2. 
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mated power spectrum of the temperature fluctuations. If f{x) denotes the 
true power spectrum then 

= fix,) + e, 

where Ci is a random error with mean 0. This is the standard regression 
model. We call Y the response variable and X the covariate. Other com¬ 
monly used names for X include predictor and independent variable. The 
function / is called the regression function. The goal in nonparametric re¬ 
gression is to estimate / making only minimal smoothness assumptions 
about /. 

The main messages of this paper are: (1) with large data sets one can 
estimate a function / nonparametrically, that is, without assuming that / 
follows some given functional form; (2) one can use the data to estimate 
the optimal amount of smoothing; (3) one can derive confidence sets for 
/ as well as confidence sets for interesting features of /. The latter point 
is very important and is an example of where rigorous statistical methods 
are a necessity; the usual confidence intervals of the form “estimate plus or 
minus error” will not suffice. 


The outline of this paper is as follows. Section 2 discusses some concep¬ 
tual issues. Section 3 discusses kernel density estimation. Section 4 discusses 
nonparametric regression. Section 5 explains something that might be less 
familiar to astrophysicists, namely, nonparametric estimation via shrink¬ 
age. Section 6 discusses nonparametric confidence intervals. In Section 7 we 
briefly discuss nonparametric Bayesian inference. We make some concluding 
remarks in Section 8. 


Notation: We denote the mean of a random quantity X by E{X), of¬ 
ten written as (X) in physics. The variance of X is denoted by = 
Var{X) = EiX — i?(X))^. A random variable X has a Normal (or Gaus¬ 
sian) distribution with mean /x and variance denoted by AT ~ N(pL,a‘^), 
if 

Pr{a < V < 6) = ^ |_^(* _ 

We use / to denote an estimate of a function /. 

2. Some Conceptual Issues 


2.1. The Bias-Variance Tradeoff. In any nonparametric problem, 
we need to find methods that produce estimates / of the unknown function 
/. Obviously, we would like / to be close to /. We will measure closeness 
with squared error: 


LifJ)= I 


(fix) - fix))‘^dx. 



The average value of the error is called the risk or mean squared error 
(MSE) and is denoted by: 


R{fJ)=Ef L{fJ) 


A simple calculation shows that 


R{fJ) 


= / Bias:! dx + 


Var^; dx 


where Bias^; = E[f{x)] — f{x) is the bias of f{x) and Var^; = Var[f{x)] = 
~ R[fi^)])^] is the variance of f{x). In words: 

RISK = BIAS^ + VARIANCE. 


Every nonparametric method involves some sort of data-smoothing. The 
difhcult task in nonparametric inference is to determine how much smooth¬ 
ing to do. When the data are over-smoothed, the bias term is large and 
the variance is small. When the data are under-smoothed the opposite is 
true; see Figure 4. This is called the bias-variance tradeoff. Minimizing risk 
corresponds to balancing bias and variance. 


2.2. Nonparametric Confidence Sets. Let / be the function of in¬ 
terest, for example, the true power spectrum in the CMB example. Assume 
that f G iF where T is some very large class of functions. A valid (large 
sample) 1 — a confidence set Cn is a set Cn C E such that 

liminf inf Pr{f G Cn) > 1 — a 

n—>-oc 

where n is sample size. In words, Cn traps the true function / with prob¬ 
ability approximately 1 — a (or greater). In parametric models, confidence 
intervals take the form 0 ± 2 se where 0 is an estimate of a parameter 9 and 
se is the standard error of the estimate 9. Bayesian interval estimates take 
essentially the same form. Nonparametric confidence sets are derived in a 
different way as we shall explain later in the paper. 

If prior information is available on / then it can be included by restricting 
Cn- For example, if it is thought that / has at most three peaks and two 
dips, we replace Cn with Cn C I where X is the set of functions with no 
more than three peaks and two dips. 

Having constructed the confidence set we are then in a position to give 
confidence intervals for features of interest. We express features as functions 
of /, written T(/). For example, T(/) might denote the location of the first 
peak in /. Then 

( inf r(/), sup X(/) I 
\^/GC„ /ec„ ) 

is a I — a confidence interval for the feature T{f). In fact, we can construct 
valid, simultaneous confidence intervals for many features of interest this 
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way, once we have Cn- In section 6, we report such intervals for the CMB 
data. 

Let us dispel a common criticism about confidence intervals. An oft cited 
but useless interpretation of a 95 per cent confidence interval is: if we 
repeated the experiment many times, the interval would contain the true 
value 95 per cent of the time. This interpretation leads many researchers to 
find conhdence sets to be irrelevant since the repetitions are hypothetical. 
The correct interpretation is: if the method for constructing Cn is used 
on a stream of (unrelated) scientihc problems, we will trap the true value 
95 per cent of the time. The latter interpretation is correct and is more 
scientifically useful than the former. 


2.3. Where is the Likelihood? The likelihood function, which is a fa¬ 
miliar centerpiece of statistical inference in parametric problems, is notably 
absent in most nonparametric methods. It is possible to dehne a likelihood 
and even perform Bayesian inference in nonparametric problems. But for 
the most part, likelihood and Bayesian methods have serious drawbacks in 
nonparametric settings. See section 7 for more discussion on this point. 

3. Kernel Density Estimation. 


We now turn to problem 1, density estimation. Let us start this sec¬ 
tion with its conclusion: the choice of kernel (smoothing hlter) is relatively 
unimportant; the choice of bandwidth (smoothing parameter) is crucial; 
the optimal bandwidth can be estimated from the data. Let us now explain 
what this means. 

Let Xi,... ,Xn denote the observed data, a sample from /. The most 
commonly used density estimator is the kernel density estimator defined 

by 


Kx) = 


1 ^ 1 
-Y 

n ^ h 

i=l 


X — Xi 


where K is called the kernel and h is called the bandwidth. This amounts to 
placing a smoothed out lump of mass of size 1/n over each data point Xi. 
Excellent references on kernel density estimation include Silverman (1986) 
and Scott (1992). 

The kernel is usually assumed to be a smooth function satisfying K(x) > 
0, f xK(x)dx = 0 and t = f x^K(x)dx > 0. A fact that is well known in 
statistics but appears to be less known in astrophysics is that the choice of 
kernel AT is not crucial. The optimal kernel that minimizes risk (for large 
samples) is called the Epanechnikov kernel K(x) = .75(1 — a;^/5)/-\/5 for 
|a:| < -v/S. But the estimates using another other smooth kernel are usually 
numerically indistinguishable. This observation is conhrmed by theoretical 
calculations which show that the risk is very insensitive to the choice of 
kernel. In this paper we use the Gaussian kernel K(x) = (27r)“^/^e“^ 



What does matter is the choice of bandwidth h which controls the 
amount of smoothing. Figure 1 shows the density estimate with four differ¬ 
ent bandwidths. Here we see how sensitive the estimate / is to the choice 
of h. Small bandwidths give very rough estimates while larger bandwidths 
give smoother estimates. Statistical theory tells us that, in one dimensional 
problems, 

R{fJ) = BIAS^ -f VARIANCE 

where ci = / x^K{x)dx, C 2 = f K(x)^dx and A(f) = f (f"(x))^dx. The 
risk is minimized by taking the bandwidth equal to 

This is informative because it tells us that the best bandwidth decreases at 
rate n~^/^ and leads to risk of order Generally, one cannot find a 

nonparametric estimator that converges faster than This rate is 

close to the rate of parametric estimators, namely, 0{n~^). The difference 
between these rates is the price we pay for being nonparametric. 

The expression for /i* depends on the unknown density / which makes 
the result of little practical use. We need a data-based method for choosing 
h. The most common method for choosing a bandwidth h from the data is 
cross-validation. The idea is as follows. 

We would like to choose h to minimize the squared error f if{x) — 
f{x)Ydz = / f‘^{x)dz — 2 / f{x)f{x)dx-\- f p{x)dx. Since / p{x)dx does 
not depend on /i, this corresponds to minimizing 

J{h)= j P{x)dz-2 J f(x)f{x)dx. 

It can be shown that 

Jih)= / p{x)dz-2-^Pi{Xi). 

is an unbiased estimate of E\J{h)\, where f-i is the “leave-one-out” 
estimate obtained by omitting Xi. Some algebra shows that 

I 3 ^ ' 

where K*{x) = K^^\x) — 2K{x) and is the convolution of K with 
itself. Hence, it is not actually necessary to compute f-i. We choose the 
bandwidth h that minimizes J{h). The lower left panel of figure I was 
based on cross-validation. An important observation for large data bases 
is that (|^) can be computed quickly using the fast Fourier transform; see 
Silverman (1986, p 61-66). 
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4. Nonparametric Kernel Regression 

Returning to the regression problem, consider pairs of points {Xi,Yi),..., (X„, Yn) 
related by 


Y, = f{Xi)+ti. 


The kernel method for density estimation also works for regression. The 
estimate / is a weighted average of the points near x: f{x) = 
where the weights are given hy Wi (x K ■ This estimator is called the 

Nadaraya-Watson estimator. Figure 2 shows that estimator for the CMB 
data. Note the extreme dependence on the bandwidth h. 

Once again, we use cross-validation to choose the bandwidth h. The risk 
is estimated by 


” i=l 

The first three panels in Figure 2 show the regression data with different 
bandwidths. The second plot is based on the cross-validation bandwidth. 
The final plot shows the estimated risk J{h) from cross validation. Fig¬ 
ure 3 compares the nonparametric fit with the fit by Wang, Tegmark and 
Zaldarriaga (2001). 

Given the small sample size and the fact that we have completely ignored 
the cosmological models (as well as differential error on each data point) the 
nonparametric fit does a remarkable job. It “confirms,” nonparametrically, 
the existence of three peaks, their approximate positions and approximate 
heights. Actually, the degree to which the fit confirms the three peaks 
requires confidence statements that we discuss in section 6. 

5. Smoothing by Shrinking 

There is another approach to nonparametric estimation based on ex¬ 
panding / into an orthogonal series. The idea is to estimate the coefficients 
of the series and then “shrink” these estimates towards 0. The operation 
of shrinking is akin to smoothing. These methods have certain advantages 
over kernel smoothers. First, the problem of estimating the bandwidth is 
replaced with the problem of choosing the amount of shrinkage which is, 
arguably, supported by better statistical theory than the former. Second, it 
is easier to construct valid confidence sets for / in this framework. Third, in 
some problems one can choose the basis in a well-informed way which will 
lead to improved estimators. For example, Donoho and Johnstone (1994, 
1995) and Johnstone (this volume) show that wavelet bases can be used to 
great advantage in certain problems. 

Suppose we observe Yi = f{xi) + ti where, for simplicity, we assume that 
xi = 1/n,X 2 = 2ln,... ,Xn = 1. Further suppose that u ~ A^(0,fT^). Let 
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<j)i,(j) 2 , ■ ■ ■ be an orthonormal basis for [0,1]: 

/ (j)^j{x)dx = 1 and / (j)i{x)4’j{x)dx = 0 when i ^ j. 

Jo Jo 

For illustration, we consider the cosine basis: Jiiix) = 1, (j) 2 ix) = 
^/2 cos{'kx), (f> 2 {x) = '/2 cos{2'kx), .... Expand / in this basis: f{x) ~ 
~ Estimating / then amounts to estimat¬ 
ing the fdj's. Let Zj = Ei(/)j(i/n). It can be shown that 

Zj^N{0, 1 '7^) ) j = ^, ■ ■ ■ where 9j = ^/njdj. Once we have estimates 

9j, we set fJj = n~^/"^9j and f{x) = 

How do we estimate 9 = {9i,..., 9^) from Z = (Zi,..., Z„)? A crude 
estimate is 9j = Zj, j = I,... ,n. This leads to a very noisy (unsmoothed) 
estimate of /. Better estimates can be found by using shrinkage estima¬ 
tors. The idea - which goes back to James and Stein (1961) and Stein 
(1981) - is to estimate 9 by shrinking the vector Z closer to the origin. 
A major discovery in mathematical statistics was that careful shrinkage 
leads to estimates with much smaller risk. Following Beran (2000) we con¬ 
sider shrinkage estimators of the form 9 = (aiZi, 02 ^ 2 ) ■ ■ ■, ctnZn) where 
1 > ai > a 2 > • • • > cxn > 0 which forces more shrinkage for higher 
frequency cosine terms. 

Let a = («!,..., On) and let R{a) denote the risk of 9 using shrink¬ 
age vector a. An estimate of R{a), called Stein’s unbiased risk estimate 
(SURE), is 

■^(«) = 

j 

where cr^ has been estimated by = -^ 'TJ^=n-k+i with k < n. Using 
appropriate numerical techniques, we minimize R{a) subject to the mono¬ 
tonicity constraint. The minimizer is denoted by d and the final estimate 
is 9 = (diZi, 02 -^ 2 , • ■ •, dnZ„). Beran (2000) shows that the estimator ob¬ 
tained this way has some important optimality properties. Beran calls this 
approach REACT (Risk Estimation, Adaptation, and Coordinate Trans¬ 
formation). The estimated function / turns out to be similar to the kernel 
estimator; due to space limitations we omit the plot. 

6. Confidence Sets 


When estimating a scalar quantity 9 with an estimator 9, it is common to 
summarize the uncertainty for the estimate by reporting 0±2se where se « 

\Jvar{9) is the standard error of the estimator. Under certain regularity 
conditions, this interval is a 95 per cent confidence interval, that is, 

Pr (^9-2se < 9 < 0 -h 2 se) « .95. 

This follows because, under the conditions alluded to above, 9 « N{9, se^). 
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But the “plus or minus 2 standard errors” rule fails in nonparametric 
inference. Consider estimating a density f{x) at a single point x with a 
kernel density estimator. It turns out that 

/(x) « N (^f{x) + bias, (2) 


where 

bias = f” {x)ci 


(3) 


is the bias, Ci = / x^K{x)dx and C 2 = / K‘^{x)dx. The estimated standard 
error is 


se = 



(4) 


Observe from (|^ that {f{x) — f{x))/se 
plus/minus 2 se” rule then 

Pr (/(x) - 2se < /(x) < /(x) + 2se^ 


« iV(bias/se, 1). If use the “estimate 


= Pr 

« Pr 


-2 < 


fix) - fix) 


se 


-2< N 


\ se 



If bias/se —> 0 then this becomes Pr(—2 < iV(0,1) < 2) « .95. As we 
explained in Section 2, the optimal bandwidth is of the form h = . 

If you plug h = this into and(||) you will see that bias/se does 

not tend to 0. The confidence interval will have coverage less than .95. 
In summary, “estimate plus/minus 2 standard errors” is not appropriate in 
nonparametric inference. There are a variety of ways to deal with this prob¬ 
lem. One is to use kernels with a suboptimal bandwidth. This undersmooths 
the estimate resulting in a reduction of bias. 

Another approach is based on the REACT method (Beran and Dumbgen, 
1998). We construct a confidence set Cn for the vector of function values 
at the observed data, f„ = (/(Xi),...,/(X„)). The confidence set Cn 
satisfies: for any c > 0, 


limsup sup |Pr(f„ S Cn) — (1 — a)! ^ 0 

n—>oo ||f„||<c 


where ||a|| = \/n~^ 'Yfii ■ The supremum is important: it means that the 
accuracy of the coverage probability does not depend on the true (unknown) 
function. 

The confidence set, expressed in terms of the coefficients d, is 


Cn= {9: n-i Y^iOj - 9^)^ < Rr 


+ n-^/^TZa 
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where Za is such that P{Z > Za) = a where Z ~ A^(0,1) and f is a 
quantity computed from the data whose formula we omit here. Finally, the 
confidence set for / is 

= |/ : / = : Pj = 6 e C 

Let us return to the CMB example. We constructed a 95 per cent con¬ 
fidence set Cn, then we searched over Cn and found the possible number, 
location and heights of the peaks. We restricted the search to functions 
with no more than three peaks and two dips as it was deemed unlikely 
that the true power spectrum would have more than three peaks within 
the range of scales presently covered by the balloon experiments. Curves 
with just one or two peaks cannot be ruled out at the 95 per cent level 
i.e. they are still viable descriptions of the data but at a low statistical 
significance than three peaked models. The confidence intervals, restricted 
to three peak models, are as follows. 

Peak Location Height 

~ (118,300) (4361,8055) 

2 (377,650) (1822,4798) 

3 (597,900) (1839,4683) 

The 95 per cent confidence interval for the ratio of the height of the 
second peak divided by the height of the first peak is (.21,1.4). The 95 
per cent confidence interval for the ratio of the height of the third peak 
divided by the height of the second peak is (.22,2.82). Not surprisingly, the 
intervals are broad because the data set is small. The reader is referred to 
Miller et al (2002), for a more complete discussion of this work and our 
final results e.g. improvements in measurement error that are needed to 
get more precise confidence sets. 

7. Nonparametric Bayes 

There seems to be great interest in Bayesian methods in astrophysics. 
The reader might wonder if it is possible to perform nonparametric 
Bayesian inference. The answer is, sort of. 

Consider estimating a density / assumed to belong to some large class 
of functions such as iF = {/ : f (f"(x))^dx < C}. The “parameter” is the 
function / and the likelihood function is £«(/) = riti Maximizing 

the likelihood leads to the absurd density estimate that puts infinite spikes 
on each data point. It is possible to put a prior tt over T. The posterior 
distribution on T is well defined and Bayes theorem still holds: 

Jc Cn{f)dTT{f) 

IjrCnif)dTr{f)' 



Pr(/GC|Xi,...,X„) 



Lest this seem somewhat abstract, take note that much recent work in 
statistics lately has led to methods for computing this posterior. 

However, there is a problem. The parameter space T is infinite dimen¬ 
sional and, in such cases, the prior tt is extremely influential. The result is 
that the posterior may concentrate around the true function very slowly. 
Worse, the 95 per cent Bayesian credible sets will contain the true function 
with very low frequency. In many cases the frequency coverage probability 
of the Bayesian 95 per cent credible set is near 0! Since high dimensional 
parametric models behave like nonparametric models, these remarks should 
give us pause before casually applying Bayesian methods to parametric 
models with many parameters. 

The results that make these comments precise are fairly technical. The 
interested reader is referred to Diaconis and Freedman (1986), Barron, 
Schervish and Wasserman (1999), Ghosal, Ghosh and van der Vaart (2000), 
Freedman (2000), Zhao (2000) and Shen and Wasserman (2001). The bot¬ 
tom line: in nonparametric problems Bayesian inference is an interesting 
research area but is not (yet?) a practical tool. 

8. Conclusion 

Nonparametric methods are at their best when the sample size is large. 
The amount and quality of astrophysics data have increased dramatically 
in the last few years. For this reason, we believe that nonparametric meth¬ 
ods will play an increasingly important role in astrophysics. We have tried 
to illustrate some of the key ideas and methods here. But we have really 
only touched on a few main points. We hope through our continued in¬ 
terdisciplinary collaboration and through others like it elsewhere, that the 
development of nonparametric techniques in astrophysics will continue in 
the future. 
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Figure 1. Redshift data. Histogram and three kernel density 
estimates based on three different bandwidths. The bandwidth 
for the estimate in the lower left panel was estimated from the 
data using cross-validation. 
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Figure 2. CMB data. Section 4 explains the methods. The first 
fit is undersmoothed, the second is oversmoothed and the third 
is based on cross-validation. The last panel shows the estimated 
risk versus the bandwidth of the smoother. The data are from 
BOOMERANG, Maxima and DASI. 
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Figure 4. The Bias-Variance tradeoff. The bias increases and the 
variance decreases with the amount of smoothing. The optimal 
amount of smoothing, indicated by the vertical line, minimizes 
the risk = bias^ -I- variance. 





